Objectives
What are the 100 highest total wage zip codes? lowest? what are the 100 highest average wage zip codes? lowest? how does the 100 total wages of the zip codes differ from the 100 average wages of the zipcodes? top 3 states with the highest/lowest wages? Make clear concise graphs and compare to other types of existing data Create a machine learning model to predict future data
Load The Neccesary libraries
library(tidyverse)
library(maps)
library(mapdata)
library(plotly)
library(caTools)
library(reshape2)
Examine data
wages <- read_csv("free-zipcode-database.csv")
Rows: 81831 Columns: 20
-- Column specification ---------------------------------------------------------------------------------------------------
Delimiter: ","
chr (10): Zipcode, ZipCodeType, City, State, LocationType, WorldRegion, Country, LocationText, Location, Notes
dbl (9): RecordNumber, Lat, Long, Xaxis, Yaxis, Zaxis, TaxReturnsFiled, EstimatedPopulation, TotalWages
lgl (1): Decommisioned
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(wages)
Rows: 81,831
Columns: 20
$ RecordNumber <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26~
$ Zipcode <chr> "00704", "00704", "00704", "00704", "00704", "00704", "00704", "00704", "00705", "00705", "00~
$ ZipCodeType <chr> "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD~
$ City <chr> "PARC PARQUE", "PASEO COSTA DEL SUR", "SECT LANAUSSE", "URB EUGENE RICE", "URB GONZALEZ", "UR~
$ State <chr> "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR~
$ LocationType <chr> "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NO~
$ Lat <dbl> 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 18.14, 18.14, 18.14, 18.14, 18.14, 18~
$ Long <dbl> -66.22, -66.22, -66.22, -66.22, -66.22, -66.22, -66.22, -66.22, -66.26, -66.26, -66.26, -66.2~
$ Xaxis <dbl> 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.3~
$ Yaxis <dbl> -0.87, -0.87, -0.87, -0.87, -0.87, -0.87, -0.87, -0.87, -0.86, -0.86, -0.86, -0.86, -0.86, -0~
$ Zaxis <dbl> 0.30, 0.30, 0.30, 0.30, 0.30, 0.30, 0.30, 0.30, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3~
$ WorldRegion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ Country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US~
$ LocationText <chr> "Parc Parque, PR", "Paseo Costa Del Sur, PR", "Sect Lanausse, PR", "Urb Eugene Rice, PR", "Ur~
$ Location <chr> "NA-US-PR-PARC PARQUE", "NA-US-PR-PASEO COSTA DEL SUR", "NA-US-PR-SECT LANAUSSE", "NA-US-PR-U~
$ Decommisioned <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
$ TaxReturnsFiled <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ EstimatedPopulation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ TotalWages <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ Notes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
Looking at United States only
wages <- wages %>% filter(Country == "US")
Removing unwanted variables
wages <- wages %>% select(-RecordNumber,-Xaxis,-Yaxis,-Zaxis,-Location,-WorldRegion,-LocationText,-LocationType,-Country,-Decommisioned,-Notes)
Removing Unwanted cases like puerto rico and NA values and duplicates
unique(wages$State)
[1] "PR" "NJ" "NY" "VI" "MA" "ME" "NH" "VT" "CT" "RI" "DE" "PA" "WV" "KY" "TN" "VA" "GA" "IN" "OH" "IL" "IA" "MN" "WI"
[24] "MT" "ND" "SD" "KS" "MO" "NE" "CO" "WY" "ID" "UT" "AZ" "NM" "TX" "CA" "NV" "OR" "WA" "AK" "GU" "HI" "AS" "PW" "FM"
[47] "MP" "MH" "FL" "SC" "AL" "MS" "LA" "AR" "OK" "MI" "DC" "MD" "NC" "AE" "AA" "AP"
wages <- wages %>% filter(State != "PR")
for(i in 1:9){
wages<- wages %>% filter(!is.na(wages[i]))
}
wages <- wages %>% distinct(Zipcode, .keep_all = TRUE)
We see a loss of about 50,000 values in the data set. About 25,000 lost from blank values and another 25,000 from duplicates. Since this is the majority of the data set, it is not a very clean data set and may not have a very accurate representation of the initial data. However, I will still plan to analyse the data in a comprehensive way in order to find answers to the proposed questions.
EXporting a csv file in order to use the cleaned data in other programs i.e. Tableau
write.table(wages,file = "CleanWages.csv",row.names = F,sep = ",")
What is the 100 highest wage zip codes?
options(scipen = 999)
highestwage <- wages %>% arrange(-TotalWages)
highestwage <- highestwage %>% mutate(rank = 1:28844)
highestwage <- highestwage %>% filter(rank <= 100)
print(head(highestwage))
highestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of High Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")

Looking at the graph of each state in the top 100 and the Total wages of the zip codes in each state, it is easy to see that California, New York, Texas, and Illinois contribute the majority of the zip codes with the highest wages.
options(scipen = 999)
highestwage %>% ggplot(aes(x = TotalWages/1000000000)) + geom_density() + labs(title = "Density of Total Wages in the Top 100" ,x = "Total Wages (Billion $)",y = "Density")

By taking a look at the graph of the density of each price we can see that the majority of zip codes in the top 100 highest wages are around 1.7 billion dollars. Also as you get past 1.9 billion dollars the amount of zip codes decreases steadily.
What is the 100 lowest wage zip codes?
lowestwage <- wages %>% arrange(TotalWages)
lowestwage <- lowestwage %>% mutate(rank = 1:28844)
lowestwage <- lowestwage %>% filter(rank <= 100)
print(head(lowestwage))
lowestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Low Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")

Here we see that Michigan, Arizona, and Texas are the 3 most frequent states in the Bottom 100.
options(scipen = 999)
lowestwage %>% ggplot(aes(x = TotalWages/1000000)) + geom_density() + labs(title = "Density of Total Wages in the Bottom 100" ,x = "Total Wages (Million $)",y = "Density")

Similar to the last density chart, this graph predicts the majority of wages in the bottom 100 are around $4.5 million.
What is the 100 highest average wage zip codes?
highestavgwage <- wages %>% arrange(-averageWage)
highestavgwage <- highestavgwage %>% mutate(rank = 1:28844)
highestavgwage <- highestavgwage %>% filter(rank <= 100)
print(head(highestavgwage))
highestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")

The average zip code wage in each state shows a different story than the total wages in each zip code. The two states that make up the majority are New York and California, where New York is much more frequent.
options(scipen = 999)
highestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Top 100" ,x = "Total Wages (Thousand $)",y = "Density")

The densities show that the majority of the average zip code wages in the top 100 are around 190 thousand dollars.
What is the 100 lowest average wage zip codes?
lowestavgwage <- wages %>% arrange(averageWage)
lowestavgwage <- lowestavgwage %>% mutate(rank = 1:28844)
lowestavgwage <- lowestavgwage %>% filter(rank <= 100)
print(head(lowestavgwage))
lowestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")

Similar to the lowest total wages, Michigan and Arizona are the highest contributors to the bottom 100 average zip code wages.
options(scipen = 999)
lowestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Bottom 100" ,x = "Total Wages (Thousand $)",y = "Density")

Here the majority of the average wages in the bottom 100 are around 7.5 thousand dollars.
What are the 3 states with the highest wages by total?
statewage <- wages %>% group_by(State) %>% summarise(
totalstatewages = sum(TotalWages))
highstatewage <- statewage %>% arrange(-totalstatewages)
highstatewage <- highstatewage %>% mutate(rank = 1:51)
highstatewage <- highstatewage %>% filter(rank <= 3)
print(head(highstatewage))
These results support the previous conclusion from Graph 1 where we saw that California Texas and New York were among the main contributors of the highest state wages. However these results give us further insight that Illinois is not actually among the top 3 highest wage states when looking at the total wages
What are the 3 states with the lowest wages by total?
lowstatewage <- statewage %>% arrange(totalstatewages)
lowstatewage <- lowstatewage %>% mutate(rank = 1:51)
lowstatewage <- lowstatewage %>% filter(rank <= 3)
print(head(lowstatewage))
What does the average of each state look like?
state_avg_wage <- wages %>%
group_by(State) %>%
summarise(avgstatewages = mean(TotalWages))
plot_geo(data = state_avg_wage,
locationmode = 'USA-states') %>%
add_trace(locations = ~State,
z = ~state_avg_wage$avgstatewages,
zmin = min(state_avg_wage$avgstatewages),
zmax = max(state_avg_wage$avgstatewages),
color = state_avg_wage$avgstatewages) %>%
layout(geo = list(scope= 'usa'),
title = "\nAverage Wages in the United States by State") %>% colorbar(tickprefix = "$")
The graph above shows that out of all the states, the highest wage location is Washington DC with an average wage of about $550 Million. The second two locations are California and New Jersey by a $150 million wage gap. In comparison to all other Locations, these three stand out as states and territories with high wages.
Importing Other Datasets From Online For Further Analysis
CostOfLiving <- read.csv("Cost of Living.csv")
StateAbrev <- read.csv("StateAbrev.csv")
Preparing Data For Joining
StateAbrev <- StateAbrev %>% rename(State = USPS.Abbreviation)
CostOfLiving <- CostOfLiving %>% rename(State.Name = State)
Joining the Datasets
CostOfLiving <- CostOfLiving %>% full_join(StateAbrev, by = "State.Name")
State_avg_vs_COL <- state_avg_wage %>% full_join(CostOfLiving, by = "State")
State_avg_vs_COL <- State_avg_vs_COL %>% select(-Rank,-State.Name)
Working With the New Dataset To Determine Potential Relationships
plot_geo(data = State_avg_vs_COL,
locationmode = 'USA-states') %>%
add_trace(locations = ~State,
z = ~State_avg_vs_COL$Index,
zmin = min(State_avg_vs_COL$Index),
zmax = max(State_avg_vs_COL$Index),
color = State_avg_vs_COL$Index) %>%
layout(geo = list(scope= 'usa'),
title = "\nCost of living Index in the United States by State")
Upon inspection of the Cost of Living map in comparison to the Averages Wages map, there are some clear trends. California and DC Remain in the top 3 in both maps. However Hawaii has a much higher cost of living than compared to its average wage. This is most likely due to its status as a “vacation state”. The rest of the states are kind of ambiguous when looking at the choropleth map. Further inspection of the correlation will give us an idea of the relationship.
Testing Correlation in order to quantify the relationship
cor(State_avg_vs_COL$Index, State_avg_vs_COL$avgstatewages, use = "pairwise.complete.obs")
[1] 0.5894002
Here we see that there is a moderately strong positive correlation between the two variables. Intuitively this is not a surprising discovery, however I will make a correlation matrix to see which of the factors that contribute to the Cost of living carry more weight when looking at the average wage in each state.
Correlation Matrix
cor_matrix <- State_avg_vs_COL %>%
select(-State) %>%
cor(use = "pairwise.complete.obs")
cor_matrix <- round(cor_matrix, digits = 2)
meltCorMat <- melt(cor_matrix)
meltCorMat %>% ggplot(aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed() +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)

Here we can see that the two biggest correlations other than The total cost of living index is the housing price index and a misc. index which I summarize to mean recreational activities and commodities such as eating out and entertainment systems.
Creating scatter map based on longitude and latitude
geo_prop <- list(scope = 'usa',
projection = list(type = 'albers usa'),
showland = TRUE,
showsubunits = TRUE,
landcolor = toRGB('gray10'),
showlakes = TRUE,
lakecolor = toRGB('white'))
plot_geo(wages,
lat = ~Lat,
lon = ~Long,
marker = list(size = wages$averageWage/15000),
text = wages$City) %>% layout(geo = geo_prop, title = "\nDensity and Intensity of the Average Wage for US Zip Codes")
No scattergeo mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
No scattergeo mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
The Map above shows all of the zip codes plotted via their latitude and longitude. The size or intensity of every point is proportional to how high the average wage in the zip code is. I just figured I would plot this because its a good looking graph and can express the range of zip codes left in the data after it had been cleaned.
Training A Linear Regression Model to Predict Average Wages In a Zip Code
Training the Model
LineWage <- wages %>% select(-State,-City,-ZipCodeType,-Zipcode)
set.seed(2)
split <- sample.split(LineWage,SplitRatio = 1/4)
train <- subset(LineWage, split = "TRUE")
test <- subset(LineWage, split = "FALSE")
model <- lm(averageWage~.,data = train)
summary(model)
Call:
lm(formula = averageWage ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-34918 -3332 -717 1722 167551
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20594.1304072651 409.3614252313 50.308 < 0.0000000000000002 ***
Lat 29.3502510222 8.1391765199 3.606 0.000311 ***
Long 26.8223086918 2.8040993302 9.565 < 0.0000000000000002 ***
TaxReturnsFiled 0.3766591942 0.0677842110 5.557 0.0000000277 ***
EstimatedPopulation -1.3068716836 0.0397504079 -32.877 < 0.0000000000000002 ***
TotalWages 0.0000537401 0.0000003298 162.923 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7090 on 28838 degrees of freedom
Multiple R-squared: 0.5078, Adjusted R-squared: 0.5077
F-statistic: 5951 on 5 and 28838 DF, p-value: < 0.00000000000000022
Testing the Model
predict <- predict(model, test)
Graphing for Accuracy
plot(predict, type = "l",col = "red") + lines(test$averageWage)
integer(0)

Calculating Root Mean Square Error for Accuracy
rmse <- sqrt(mean(predict-LineWage$averageWage)^2)
print(rmse)
[1] 0.00000000001681188
The error calculation is very low; this indicates a well trained model for future data.
---
title: "Wages Via Zip Code"
author: "Elijah Silfies"
date: "11/11/2021"
output: html_notebook
---
### Objectives
What are the 100 highest total wage zip codes? lowest?
what are the 100 highest average wage zip codes? lowest?
how does the 100 total wages of the zip codes differ from the 100 average wages of the zipcodes?
top 3 states with the highest/lowest wages?
Make clear concise graphs and compare to other types of existing data
Create a machine learning model to predict future data



### Load The Neccesary libraries
```{r}
library(tidyverse)
library(maps)
library(mapdata)
library(plotly)
library(caTools)
library(reshape2)
```

### Examine data
```{r}
wages <- read_csv("free-zipcode-database.csv")
glimpse(wages)
```

### Looking at United States only
```{r}
wages <- wages %>% filter(Country == "US")
```

### Removing unwanted variables
```{r}
wages <- wages %>% select(-RecordNumber,-Xaxis,-Yaxis,-Zaxis,-Location,-WorldRegion,-LocationText,-LocationType,-Country,-Decommisioned,-Notes)
```

### Removing Unwanted cases like puerto rico and NA values and duplicates
```{r}
unique(wages$State)

wages <- wages %>% filter(State != "PR")

for(i in 1:9){
  wages<- wages %>% filter(!is.na(wages[i]))
}

wages <- wages %>% distinct(Zipcode, .keep_all = TRUE)
```
We see a loss of about 50,000 values in the data set. About 25,000 lost from blank values and another 25,000 from duplicates. Since this is the majority of the data set, it is not a very clean data set and may not have a very accurate representation of the initial data. However, I will still plan to analyse the data in a comprehensive way in order to find answers to the proposed questions.

### Create an average wage column by dividing the total wage by the estimated population and formatting for the dollar amounts
```{r}
wages <- wages %>% mutate(averageWage = format(round(TotalWages/EstimatedPopulation,2), nsmall = 2))
wages <- wages %>% mutate(averageWage = TotalWages/EstimatedPopulation)
```

### EXporting a csv file in order to use the cleaned data in other programs i.e. Tableau
```{r}
write.table(wages,file = "CleanWages.csv",row.names = F,sep = ",")
```


### What is the 100 highest wage zip codes?
```{r}
options(scipen = 999)
highestwage <- wages %>% arrange(-TotalWages)
highestwage <- highestwage %>% mutate(rank = 1:28844)
highestwage <- highestwage %>% filter(rank <= 100)
print(head(highestwage))
```



```{r}

highestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of High Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")
```
Looking at the graph of each state in the top 100 and the Total wages of the zip codes in each state, it is easy to see that California, New York, Texas, and Illinois contribute the majority of the zip codes with the highest wages.

```{r}
options(scipen = 999)
highestwage %>% ggplot(aes(x = TotalWages/1000000000)) + geom_density() + labs(title = "Density of Total Wages in the Top 100" ,x = "Total Wages (Billion $)",y = "Density") 
```
By taking a look at the graph of the density of each price we can see that the majority of zip codes in the top 100 highest wages are around 1.7 billion dollars. Also as you get past 1.9 billion dollars the amount of zip codes decreases steadily.

### What is the 100 lowest wage zip codes?
```{r}
lowestwage <- wages %>% arrange(TotalWages)
lowestwage <- lowestwage %>% mutate(rank = 1:28844)
lowestwage <- lowestwage %>% filter(rank <= 100)
print(head(lowestwage))
```

```{r}
lowestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Low Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")
```
Here we see that Michigan, Arizona, and Texas are the 3 most frequent states in the Bottom 100.


```{r}
options(scipen = 999)
lowestwage %>% ggplot(aes(x = TotalWages/1000000)) + geom_density() + labs(title = "Density of Total Wages in the Bottom 100" ,x = "Total Wages (Million $)",y = "Density") 
```
Similar to the last density chart, this graph predicts the majority of wages in the bottom 100 are around $4.5 million.


### What is the 100 highest average wage zip codes?
```{r}
highestavgwage <- wages %>% arrange(-averageWage)
highestavgwage <- highestavgwage %>% mutate(rank = 1:28844)
highestavgwage <- highestavgwage %>% filter(rank <= 100)
print(head(highestavgwage))
```

```{r}
highestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")
```
The average zip code wage in each state shows a different story than the total wages in each zip code. The two states that make up the majority are New York and California, where New York is much more frequent.

```{r}
options(scipen = 999)
highestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Top 100" ,x = "Total Wages (Thousand $)",y = "Density") 
```
The densities show that the majority of the average zip code wages in the top 100 are around 190 thousand dollars. 

### What is the 100 lowest average wage zip codes?
```{r}
lowestavgwage <- wages %>% arrange(averageWage)
lowestavgwage <- lowestavgwage %>% mutate(rank = 1:28844)
lowestavgwage <- lowestavgwage %>% filter(rank <= 100)
print(head(lowestavgwage))
```

```{r}
lowestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")
```
Similar to the lowest total wages, Michigan and Arizona are the highest contributors to the bottom 100 average zip code wages.


```{r}
options(scipen = 999)
lowestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Bottom 100" ,x = "Total Wages (Thousand $)",y = "Density") 
```
Here the majority of the average wages in the bottom 100 are around 7.5 thousand dollars.

### What are the 3 states with the highest wages by total?
```{r}
statewage <- wages %>% group_by(State) %>% summarise(
  totalstatewages = sum(TotalWages))
highstatewage <- statewage %>% arrange(-totalstatewages)
highstatewage <- highstatewage %>% mutate(rank = 1:51)
highstatewage <- highstatewage %>% filter(rank <= 3)
print(head(highstatewage))
```
These results support the previous conclusion from Graph 1 where we saw that California Texas and New York were among the main contributors of the highest state wages. However these results give us further insight that Illinois is not actually among the top 3 highest wage states when looking at the total wages


### What are the 3 states with the lowest wages by total?
```{r}
lowstatewage <- statewage %>% arrange(totalstatewages)
lowstatewage <- lowstatewage %>% mutate(rank = 1:51)
lowstatewage <- lowstatewage %>% filter(rank <= 3)
print(head(lowstatewage))
```


### What does the average of each state look like?
```{r}
state_avg_wage <- wages %>% 
  group_by(State) %>% 
  summarise(avgstatewages = mean(TotalWages))

plot_geo(data = state_avg_wage,
                      locationmode = 'USA-states') %>% 
  add_trace(locations = ~State,
            z = ~state_avg_wage$avgstatewages,
            zmin = min(state_avg_wage$avgstatewages), 
            zmax = max(state_avg_wage$avgstatewages),
            color = state_avg_wage$avgstatewages) %>% 
  layout(geo = list(scope= 'usa'),
         title = "\nAverage Wages in the United States by State") %>% colorbar(tickprefix = "$")
```
The graph above shows that out of all the states, the highest wage location is Washington DC with an average wage of about $550 Million. The second two locations are California and New Jersey by a $150 million wage gap. In comparison to all other Locations, these three stand out as states and territories with high wages.

### Showing the Highest Average Wages in Tabular Form 
```{r}
state_avg_wage <- state_avg_wage %>% 
  arrange(-avgstatewages ) 
head(state_avg_wage)
```

### Importing Other Datasets From Online For Further Analysis
```{r}
CostOfLiving <- read.csv("Cost of Living.csv")
StateAbrev <- read.csv("StateAbrev.csv")
```

### Preparing Data For Joining
```{r}
StateAbrev <- StateAbrev %>% rename(State = USPS.Abbreviation)
CostOfLiving <- CostOfLiving %>% rename(State.Name = State)
```

### Joining the Datasets
```{r}
CostOfLiving <- CostOfLiving %>% full_join(StateAbrev, by = "State.Name")
State_avg_vs_COL <- state_avg_wage %>% full_join(CostOfLiving, by = "State")
State_avg_vs_COL <- State_avg_vs_COL %>% select(-Rank,-State.Name)
```

### Working With the New Dataset To Determine Potential Relationships
```{r}
plot_geo(data = State_avg_vs_COL,
                      locationmode = 'USA-states') %>% 
  add_trace(locations = ~State,
            z = ~State_avg_vs_COL$Index,
            zmin = min(State_avg_vs_COL$Index), 
            zmax = max(State_avg_vs_COL$Index),
            color = State_avg_vs_COL$Index) %>% 
  layout(geo = list(scope= 'usa'),
         title = "\nCost of living Index in the United States by State")
```
Upon inspection of the Cost of Living map in comparison to the Averages Wages map, there are some clear trends. California and DC Remain in the top 3 in both maps. However Hawaii has a much higher cost of living than compared to its average wage. This is most likely due to its status as a "vacation state". The rest of the states are kind of ambiguous when looking at the choropleth map. Further inspection of the correlation will give us an idea of the relationship.

### Testing Correlation in order to quantify the relationship
```{r}
cor(State_avg_vs_COL$Index, State_avg_vs_COL$avgstatewages, use = "pairwise.complete.obs")
```
Here we see that there is a moderately strong positive correlation between the two variables. Intuitively this is not a surprising discovery, however I will make a correlation matrix to see which of the factors that contribute to the Cost of living carry more weight when looking at the average wage in each state.

### Correlation Matrix
```{r}
cor_matrix <- State_avg_vs_COL %>%
  select(-State) %>%
  cor(use = "pairwise.complete.obs")
cor_matrix <- round(cor_matrix, digits = 2)

meltCorMat <- melt(cor_matrix)

meltCorMat %>% ggplot(aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed() +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)
```
Here we can see that the two biggest correlations other than The total cost of living index is the housing price index and a misc. index which I summarize to mean recreational activities and commodities such as eating out and entertainment systems.

### Creating scatter map based on longitude and latitude
```{r}
geo_prop <- list(scope = 'usa',
                 projection = list(type = 'albers usa'), 
                 showland = TRUE,
                 showsubunits = TRUE,
                 landcolor = toRGB('gray10'),
                 showlakes = TRUE, 
                 lakecolor = toRGB('white'))

plot_geo(wages, 
        lat = ~Lat,
        lon = ~Long,
        marker = list(size = wages$averageWage/15000),
        text = wages$City) %>% layout(geo = geo_prop, title = "\nDensity and Intensity of the Average Wage for US Zip Codes")
```
The Map above shows all of the zip codes plotted via their latitude and longitude. The size or intensity of every point is proportional to how high the average wage in the zip code is. I just figured I would plot this because its a good looking graph and can express the range of zip codes left in the data after it had been cleaned.



## Training A Linear Regression Model to Predict Average Wages In a Zip Code

### Training the Model
```{r}
LineWage <- wages %>% select(-State,-City,-ZipCodeType,-Zipcode)

set.seed(2)
split <- sample.split(LineWage,SplitRatio = 1/4)
train <- subset(LineWage, split = "TRUE")
test <- subset(LineWage, split = "FALSE")

model <- lm(averageWage~.,data = train)
summary(model)
```

### Testing the Model
```{r}
predict <- predict(model, test)
```

### Graphing for Accuracy
```{r}
plot(predict, type = "l",col = "red") + lines(test$averageWage)
```

### Calculating Root Mean Square Error for Accuracy
```{r}
rmse <- sqrt(mean(predict-LineWage$averageWage)^2)
print(rmse)
```
The error calculation is very low; this indicates a well trained model for future data.


### Sources

1. Kaggle:   https://www.kaggle.com/pavansanagapati/us-wages-via-zipcode
2. YourDictionary:   https://abbreviations.yourdictionary.com/articles/state-abbrev.html
3. MERIC:   https://meric.mo.gov/data/cost-living-data-series
